This 'Introduction to R for Data Science' is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1020 was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
This lesson is the third in a 6-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results.
The structure of the class is a code-along style: It is fully hands on. Prior to each lecture, the materials will be emailed to you and will also be available for download at QUERCUS, so you can spend more time coding than taking notes.
At the end of this session you will know the principles of tidy data, and be able to subset and transform your data to perform simple calculations. You will be able to add new rows and columns to your data frame.
Today we are going to be learning about tidy data and how it makes data analysis less of a pain. We will perform some basic statistics on our newly transformed dataset. The next lesson will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots - from simple data exploration to interactive plots - this is always a fun lesson.
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a database is formatted.
In wide format, variables may be listed in the first column, each forming a row of its own. Observations may be presented as columns that fill observed values for each variable.
In long format, each variable is its own column, and the results of each measured variable are recorded in rows. In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data.
To read more about wide and long formats, visit https://eagereyes.org/basics/spreadsheet-thinking-vs-database-thinking.
Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene
16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.
An OTU-taxa table. (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.
B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334
This dataset is the result of 16S rRNA gene amplicon sequencing of samples from microbial communities cultured in fresh, brackish, or saline media. Treatments received the aromatic compounds toluene or pyrene as the sole source of carbon and energy. Controls did not receive any compounds (substrate-free) to account for any alternative carbon sources present in the media. The objective of this experiment was to evaluate which microorganisms would make use of toluene and pyrene.
Taxa table with gene sequences (ASV) and sample information. This file must be converted from wide to long format before use.
tidyverse (tidyr, dplyr, tibble)
gapminder
Some of these packages should already be installed into your Anaconda base from Lecture 02. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
To install gapminder in the Anaconda prompt, use the command:
conda install conda-forge r-gapminder
#install.packages("gapminder", dependencies = TRUE)
# load our libraries
library(readr)
library(tidyverse)
library(gapminder)
Note that 8 different packages are loaded, and that 2 functions from the stats package have been replaced by functions of the same name by dplyr. Note that you can still access the stats version of the function by calling it directly as stats::filter().
In this lesson we want to answer 3 simple questions:
Which latrine depth has the greatest mean number of OTUs? (remember the Country_LatrineNo_Depth site encoding)
Is there more Clostridia in Tanzania or Vietnam?
Which site had the greatest number of Taxa represented?
Last lesson, we learned how to filter and select data subsets we were interested in. However, we can make data manipulation more efficient by controlling the overall structure or format of our data.
Let's read in our dataset, store it in a variable, and remind ourselves about the original structure.
# Always check where you are first!
getwd()
list.files("data/")
# Notice the difference between tidyverse's read_csv() and base R's read.csv().
# The default way to read-in the file is also different from read.csv()
data <- read_csv("data/taxa_pitlatrine_wide.csv")
# Take a look at the head and structure of your data
head(data)
str(data)
-- Column specification -------------------------------------------------------- cols( .default = col_double(), Taxa = col_character() ) i Use `spec()` for the full column specifications.
| Taxa | Tanzania_2_1 | Tanzania_2_10 | Tanzania_2_12 | Tanzania_2_2 | Tanzania_2_3 | Tanzania_2_6 | Tanzania_2_7 | Tanzania_2_9 | Tanzania_3_2 | ... | Vietnam_6_2 | Vietnam_6_3 | Vietnam_7_1 | Vietnam_7_2 | Vietnam_7_3 | Vietnam_8_2 | Vietnam_9_1 | Vietnam_9_2 | Vietnam_9_3 | Vietnam_9_4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| Acidobacteria_Gp1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 18 | 0 | 38 |
| Acidobacteria_Gp10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 9 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 17 |
| Acidobacteria_Gp16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp18 | 0 | 0 | 0 | 2 | 2 | 5 | 0 | 15 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
tibble [52 x 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Taxa : chr [1:52] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Tanzania_2_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_2_10: num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_2_12: num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_2_2 : num [1:52] 0 0 0 0 0 2 0 0 0 0 ... $ Tanzania_2_3 : num [1:52] 0 0 0 0 0 2 0 0 0 0 ... $ Tanzania_2_6 : num [1:52] 0 0 0 0 0 5 0 0 0 1 ... $ Tanzania_2_7 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_2_9 : num [1:52] 0 0 0 0 0 15 0 0 0 0 ... $ Tanzania_3_2 : num [1:52] 0 0 0 0 0 2 0 0 0 0 ... $ Tanzania_3_3 : num [1:52] 0 0 0 0 0 1 0 0 0 0 ... $ Tanzania_3_5 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_4_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_4_4 : num [1:52] 0 0 0 0 0 1 0 0 0 0 ... $ Tanzania_4_5 : num [1:52] 1 0 1 0 0 0 0 0 32 0 ... $ Tanzania_4_6 : num [1:52] 0 0 0 0 0 0 0 0 2 1 ... $ Tanzania_4_7 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_5_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_5_3 : num [1:52] 0 0 0 0 0 1 0 0 0 0 ... $ Tanzania_5_4 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_5_5 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_6_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_6_5 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_6_7 : num [1:52] 0 0 0 0 0 1 0 0 1 1 ... $ Tanzania_6_8 : num [1:52] 0 0 0 0 0 0 0 0 0 1 ... $ Tanzania_9_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_9_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_9_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_9_4 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Tanzania_9_5 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_1_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_10_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_11_1 : num [1:52] 0 0 0 1 1 0 0 0 1 7 ... $ Vietnam_11_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_11_3 : num [1:52] 0 0 0 0 0 0 0 0 1 3 ... $ Vietnam_12_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_12_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_13_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_13_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_14_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_14_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_14_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_15_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_15_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_15_3 : num [1:52] 0 0 0 0 0 0 0 0 2 0 ... $ Vietnam_16_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_16_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_17_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_17_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_18_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_18_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_18_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_18_4 : num [1:52] 0 0 0 0 0 0 0 0 10 0 ... $ Vietnam_19_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_19_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_19_3 : num [1:52] 0 0 0 0 0 0 0 0 9 0 ... $ Vietnam_2_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_2_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_2_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_20_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_21_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_21_4 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_22_1 : num [1:52] 1 3 0 0 0 0 0 1 2 7 ... $ Vietnam_22_3 : num [1:52] 0 0 0 0 0 0 0 0 0 1 ... $ Vietnam_22_4 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_3_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_3_2 : num [1:52] 0 0 0 0 0 0 3 0 0 0 ... $ Vietnam_4_1 : num [1:52] 0 0 0 0 0 0 0 0 0 1 ... $ Vietnam_4_2 : num [1:52] 0 2 0 1 0 0 2 0 3 2 ... $ Vietnam_5_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_5_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_6_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_6_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_6_3 : num [1:52] 0 0 0 0 0 0 0 0 11 1 ... $ Vietnam_7_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_7_2 : num [1:52] 1 1 0 0 0 0 1 0 1 0 ... $ Vietnam_7_3 : num [1:52] 7 9 1 0 0 1 3 4 68 8 ... $ Vietnam_8_2 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_9_1 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_9_2 : num [1:52] 18 0 0 0 0 0 0 0 0 0 ... $ Vietnam_9_3 : num [1:52] 0 0 0 0 0 0 0 0 0 0 ... $ Vietnam_9_4 : num [1:52] 38 0 17 0 0 0 0 0 19 0 ... - attr(*, "spec")= .. cols( .. Taxa = col_character(), .. Tanzania_2_1 = col_double(), .. Tanzania_2_10 = col_double(), .. Tanzania_2_12 = col_double(), .. Tanzania_2_2 = col_double(), .. Tanzania_2_3 = col_double(), .. Tanzania_2_6 = col_double(), .. Tanzania_2_7 = col_double(), .. Tanzania_2_9 = col_double(), .. Tanzania_3_2 = col_double(), .. Tanzania_3_3 = col_double(), .. Tanzania_3_5 = col_double(), .. Tanzania_4_3 = col_double(), .. Tanzania_4_4 = col_double(), .. Tanzania_4_5 = col_double(), .. Tanzania_4_6 = col_double(), .. Tanzania_4_7 = col_double(), .. Tanzania_5_2 = col_double(), .. Tanzania_5_3 = col_double(), .. Tanzania_5_4 = col_double(), .. Tanzania_5_5 = col_double(), .. Tanzania_6_2 = col_double(), .. Tanzania_6_5 = col_double(), .. Tanzania_6_7 = col_double(), .. Tanzania_6_8 = col_double(), .. Tanzania_9_1 = col_double(), .. Tanzania_9_2 = col_double(), .. Tanzania_9_3 = col_double(), .. Tanzania_9_4 = col_double(), .. Tanzania_9_5 = col_double(), .. Vietnam_1_2 = col_double(), .. Vietnam_10_1 = col_double(), .. Vietnam_11_1 = col_double(), .. Vietnam_11_2 = col_double(), .. Vietnam_11_3 = col_double(), .. Vietnam_12_1 = col_double(), .. Vietnam_12_2 = col_double(), .. Vietnam_13_1 = col_double(), .. Vietnam_13_2 = col_double(), .. Vietnam_14_1 = col_double(), .. Vietnam_14_2 = col_double(), .. Vietnam_14_3 = col_double(), .. Vietnam_15_1 = col_double(), .. Vietnam_15_2 = col_double(), .. Vietnam_15_3 = col_double(), .. Vietnam_16_1 = col_double(), .. Vietnam_16_2 = col_double(), .. Vietnam_17_1 = col_double(), .. Vietnam_17_2 = col_double(), .. Vietnam_18_1 = col_double(), .. Vietnam_18_2 = col_double(), .. Vietnam_18_3 = col_double(), .. Vietnam_18_4 = col_double(), .. Vietnam_19_1 = col_double(), .. Vietnam_19_2 = col_double(), .. Vietnam_19_3 = col_double(), .. Vietnam_2_1 = col_double(), .. Vietnam_2_2 = col_double(), .. Vietnam_2_3 = col_double(), .. Vietnam_20_1 = col_double(), .. Vietnam_21_1 = col_double(), .. Vietnam_21_4 = col_double(), .. Vietnam_22_1 = col_double(), .. Vietnam_22_3 = col_double(), .. Vietnam_22_4 = col_double(), .. Vietnam_3_1 = col_double(), .. Vietnam_3_2 = col_double(), .. Vietnam_4_1 = col_double(), .. Vietnam_4_2 = col_double(), .. Vietnam_5_1 = col_double(), .. Vietnam_5_3 = col_double(), .. Vietnam_6_1 = col_double(), .. Vietnam_6_2 = col_double(), .. Vietnam_6_3 = col_double(), .. Vietnam_7_1 = col_double(), .. Vietnam_7_2 = col_double(), .. Vietnam_7_3 = col_double(), .. Vietnam_8_2 = col_double(), .. Vietnam_9_1 = col_double(), .. Vietnam_9_2 = col_double(), .. Vietnam_9_3 = col_double(), .. Vietnam_9_4 = col_double() .. )
Why tidy data?
Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.
Tidy data has:
This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
The 5 most common problems with messy datasets are:
Fortunately there are some tools available to solve these problems.
The tidyverse is the universe of packages created by Hadley Wickham for data analysis. There are packages to help import, tidy, transform, model and visualize data. His packages are pretty popular, so he made a package to load all of his packages at once. This wrapper package is tidyverse. In this lesson series we have used dplyr, readr and readxl, and we will be using dplyr and tidyr today.
Hadley has a large fan-base. Someone even made a plot of Hadley using his own package, ggplot2.
Back to the normalverse...
# Let's review our data again
head(data)
| Taxa | Tanzania_2_1 | Tanzania_2_10 | Tanzania_2_12 | Tanzania_2_2 | Tanzania_2_3 | Tanzania_2_6 | Tanzania_2_7 | Tanzania_2_9 | Tanzania_3_2 | ... | Vietnam_6_2 | Vietnam_6_3 | Vietnam_7_1 | Vietnam_7_2 | Vietnam_7_3 | Vietnam_8_2 | Vietnam_9_1 | Vietnam_9_2 | Vietnam_9_3 | Vietnam_9_4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| Acidobacteria_Gp1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 18 | 0 | 38 |
| Acidobacteria_Gp10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 9 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 17 |
| Acidobacteria_Gp16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp18 | 0 | 0 | 0 | 2 | 2 | 5 | 0 | 15 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Which tidy data rules might our data frame break?
At first glance we can see that the column names are actually 3 different variables: 'Country', 'LatrineNumber', and 'Depth'. This information will likely be useful in our study, as we expect different bacteria at different depths, sites, and geographical locations. Each of these is a variable and should have its own separate column.
We could keep the column names as the sample names (as they are meaningful to the researcher) and add the extra variable columns, or we could make up sample names (ie. Sample_1) knowing that the information is not being lost, but rather stored in a more useful format.
Some of the Taxa also appear to have an additional variable of information (ie. _Gp1), but not all taxa have this information. We could also make a separate column for this information.
Each result is the same observational unit (i.e. relative abundances of bacteria), so having one table is fine.
tidyr¶tidyr is a package with functions that help us turn our 'messy' data into 'tidy' data. It has 2 major workhorse functions and 2 other tidying functions:
gather() - convert a data frame from wide to long formatspread() - convert a data frame from long to wide formatseparate() - split a column into 2 or more columns based on a string separatorunite() - merge 2 or more columns into 1 column using a string separatorgather() and spread() rely on key-value pairs to collapse or expand columns.
We've already loaded tidyverse which includes the tidyr package that the gather() function is from.
gather() your data from across columns¶We can use the gather() function to collect our columns. This will make our dataset 'long' instead of 'wide'.
gather(
data,
key = "key",
value = "value",
...,
na.rm = FALSE,
convert = FALSE,
factor_key = FALSE
)
We need to provide gather() with information on our new columns.
Note: This function has been retired in tidyr and replaced with pivot_longer() which has many more options allowing for more user-friendly and flexible conversion. For our purposes gather() will do just fine.
First let's pull our data into a data frame.
# Already loaded our data so no need to do it again
#data <- read_csv(...)
head(data)
# Note the width is 82 columns! 81 are data observations
| Taxa | Tanzania_2_1 | Tanzania_2_10 | Tanzania_2_12 | Tanzania_2_2 | Tanzania_2_3 | Tanzania_2_6 | Tanzania_2_7 | Tanzania_2_9 | Tanzania_3_2 | ... | Vietnam_6_2 | Vietnam_6_3 | Vietnam_7_1 | Vietnam_7_2 | Vietnam_7_3 | Vietnam_8_2 | Vietnam_9_1 | Vietnam_9_2 | Vietnam_9_3 | Vietnam_9_4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| Acidobacteria_Gp1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 18 | 0 | 38 |
| Acidobacteria_Gp10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 9 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 17 |
| Acidobacteria_Gp16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp18 | 0 | 0 | 0 | 2 | 2 | 5 | 0 | 15 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
# Remember we are piping "data" in as the first argument of gather
data %>%
gather(key=Site, value = OTUs, 2:82) %>%
head()
| Taxa | Site | OTUs |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp10 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp14 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp16 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp17 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp18 | Tanzania_2_1 | 0 |
#equivalent to all on a single line
data %>% gather(key = Site, value = OTUs, 2:82) %>% head()
#equivalent to cutting the first column
data %>% gather(Site, OTUs, -1) %>% head()
#equivalent to cutting the first column by name
data %>% gather(Site, OTUs, -Taxa) %>% head()
| Taxa | Site | OTUs |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp10 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp14 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp16 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp17 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp18 | Tanzania_2_1 | 0 |
| Taxa | Site | OTUs |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp10 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp14 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp16 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp17 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp18 | Tanzania_2_1 | 0 |
| Taxa | Site | OTUs |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp10 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp14 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp16 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp17 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp18 | Tanzania_2_1 | 0 |
In the above examples -1 means gather every column except the 1st, or gather every column except "Taxa". Taxa would still be retained as a column but its elements are not grouped in with 'sites' as an observation (i.e. we do not want 'Vietnam_9_4', 'Tanzania_2_9', and 'Clostridia' gathered into the same column).
Let's save the last variation into a data frame called gathered_data.
gathered_data <- data %>% gather(Site, OTUs, -Taxa) # long format
#head(gathered_data)
str(gathered_data)
tibble [4,212 x 3] (S3: tbl_df/tbl/data.frame) $ Taxa: chr [1:4212] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Site: chr [1:4212] "Tanzania_2_1" "Tanzania_2_1" "Tanzania_2_1" "Tanzania_2_1" ... $ OTUs: num [1:4212] 0 0 0 0 0 0 0 0 0 0 ...
gather()?¶Note how the dimensions of your dataframe have changed relative to data. Instead of 52 rows and 82 columns, we now have a data frame with 4,212 rows and 3 columns (which is the 81 columns we gathered x 52 rows). gathered_data is now in a long format instead of wide.
separate() can split variables into multiple columns by specifying a text delimiter¶Notice the information contained in our "Site" data? It is a combination of 3 values: country, latrine number, and depth information all separated with "_" between each.
We can use the separate() function to retrieve the Country, Latrine_Number, and Depth information from our Site column. separate() takes in your dataframe, the name of the column to be split, the names of your new columns, and the character that you want to split the columns by (in this case an underscore). Note that the default is to remove your original column - you can keep it by adding the argument remove = FALSE, keeping in mind that you now have redundant data.
separate(
data,
col,
into,
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE,
extra = "warn",
fill = "warn",
...
)
We need to provide separate() with information to help split our information.
separate() how to break up the information in each row of col.split_gathered_data <- gathered_data %>%
separate(Site, c("Country", "Latrine_Number", "Depth"),
sep = "_")
head(split_gathered_data)
#equivalent to setting the parameter "remove" to TRUE, although this is the default behaviour
gathered_data %>%
separate(Site, c("Country", "Latrine_Number", "Depth"),
sep = "_", remove=FALSE) %>% head()
#head(split_gathered_data)
| Taxa | Country | Latrine_Number | Depth | OTUs |
|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp10 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp14 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp16 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp17 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp18 | Tanzania | 2 | 1 | 0 |
| Taxa | Site | Country | Latrine_Number | Depth | OTUs |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania_2_1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp10 | Tanzania_2_1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp14 | Tanzania_2_1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp16 | Tanzania_2_1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp17 | Tanzania_2_1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria_Gp18 | Tanzania_2_1 | Tanzania | 2 | 1 | 0 |
We may also want to do this for the 'Group' of Acidobacteria. Try the code, but do not save the answer in a variable.
# practice splitting "Taxa" into two columns: Taxa and Group using the separate command
split_gathered_data %>%
separate(Taxa, c("Taxa", "Group"), sep = "_Gp") %>%
head(10)
Warning message: "Expected 2 pieces. Missing pieces filled with `NA` in 3078 rows [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, ...]."
| Taxa | Group | Country | Latrine_Number | Depth | OTUs |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> |
| Acidobacteria | 1 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 10 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 14 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 16 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 17 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 18 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 21 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 22 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 3 | Tanzania | 2 | 1 | 0 |
| Acidobacteria | 4 | Tanzania | 2 | 1 | 0 |
Notice that we have triggered a warning from R that it has filled in 'NA' for the bacteria that did not have groups. Also see that I chose to split Taxa using '_Gp' since I did not need 'Gp'.
glimpse() and str() information for our split data¶Use the glimpse() function to look at the type of each variable in our new data frame, split_gathered_data. Are those the types you expected? Why or why not? How is glimpse() different from the str() function?
print ("glimpse of our split data")
glimpse(split_gathered_data)
# Latrine_Number and Depth are characters, not numeric or integer because they were derived from a character string
# Glimpse is customized for tibbles and data frames and shows as much data as possible
print ("structure of our split data")
str(split_gathered_data)
[1] "glimpse of our split data" Rows: 4,212 Columns: 5 $ Taxa <chr> "Acidobacteria_Gp1", "Acidobacteria_Gp10", "Acidobac... $ Country <chr> "Tanzania", "Tanzania", "Tanzania", "Tanzania", "Tan... $ Latrine_Number <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2... $ Depth <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1... $ OTUs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 110, 11, 2... [1] "structure of our split data" tibble [4,212 x 5] (S3: tbl_df/tbl/data.frame) $ Taxa : chr [1:4212] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Country : chr [1:4212] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: chr [1:4212] "2" "2" "2" "2" ... $ Depth : chr [1:4212] "1" "1" "1" "1" ... $ OTUs : num [1:4212] 0 0 0 0 0 0 0 0 0 0 ...
group_by() to organize our dataset¶There is a useful function group_by() that you can use to group variables or sets of variables together. This is useful for calculations and plotting on subsets of your data without having to turn your variables into factors.
Say I wanted to look at a combination of Country and Well Depth. While visually, you wouldn't notice any changes to your data frame, if you look at the structure it will now be a 'grouped_df'. There are 15 groupings resulting from Country and Depth. After I have performed my desired operation, I can return my data frame to its original structure by calling ungroup().
First we will examine the structure of grouped and ungrouped output without any additional operations.
# Look at the structure of our data
print("split_gathered_data in native format")
split_gathered_data %>% str()
# Group and then look at the structure of our data
print("split_gathered_data in a grouped data frame")
split_gathered_data %>% group_by(Country, Depth) %>% str()
[1] "split_gathered_data in native format" tibble [4,212 x 5] (S3: tbl_df/tbl/data.frame) $ Taxa : chr [1:4212] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Country : chr [1:4212] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: chr [1:4212] "2" "2" "2" "2" ... $ Depth : chr [1:4212] "1" "1" "1" "1" ... $ OTUs : num [1:4212] 0 0 0 0 0 0 0 0 0 0 ... [1] "split_gathered_data in a grouped data frame" tibble [4,212 x 5] (S3: grouped_df/tbl_df/tbl/data.frame) $ Taxa : chr [1:4212] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Country : chr [1:4212] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: chr [1:4212] "2" "2" "2" "2" ... $ Depth : chr [1:4212] "1" "1" "1" "1" ... $ OTUs : num [1:4212] 0 0 0 0 0 0 0 0 0 0 ... - attr(*, "groups")= tibble [15 x 3] (S3: tbl_df/tbl/data.frame) ..$ Country: chr [1:15] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... ..$ Depth : chr [1:15] "1" "10" "12" "2" ... ..$ .rows : list<int> [1:15] .. ..$ : int [1:104] 1 2 3 4 5 6 7 8 9 10 ... .. ..$ : int [1:52] 53 54 55 56 57 58 59 60 61 62 ... .. ..$ : int [1:52] 105 106 107 108 109 110 111 112 113 114 ... .. ..$ : int [1:260] 157 158 159 160 161 162 163 164 165 166 ... .. ..$ : int [1:260] 209 210 211 212 213 214 215 216 217 218 ... .. ..$ : int [1:156] 625 626 627 628 629 630 631 632 633 634 ... .. ..$ : int [1:260] 521 522 523 524 525 526 527 528 529 530 ... .. ..$ : int [1:104] 261 262 263 264 265 266 267 268 269 270 ... .. ..$ : int [1:156] 313 314 315 316 317 318 319 320 321 322 ... .. ..$ : int [1:52] 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 ... .. ..$ : int [1:52] 365 366 367 368 369 370 371 372 373 374 ... .. ..$ : int [1:1040] 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 ... .. ..$ : int [1:884] 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 ... .. ..$ : int [1:572] 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 ... .. ..$ : int [1:208] 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 ... .. ..@ ptype: int(0) ..- attr(*, ".drop")= logi TRUE
# Group and then ungroup our data
print("split_gathered_data grouped and then ungrouped")
split_gathered_data %>% group_by(Country, Depth) %>% ungroup() %>% str()
[1] "split_gathered_data grouped and then ungrouped" tibble [4,212 x 5] (S3: tbl_df/tbl/data.frame) $ Taxa : chr [1:4212] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Country : chr [1:4212] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: chr [1:4212] "2" "2" "2" "2" ... $ Depth : chr [1:4212] "1" "1" "1" "1" ... $ OTUs : num [1:4212] 0 0 0 0 0 0 0 0 0 0 ...
group_by() and summarize() functions to produce sensible stat data¶Now we can see an example of how group_by() in action with summarize() (or summarise()), can easily calculate summary statistics for groups of data. Whereas in our messy data frame it was difficult to do calculations based on Country, Well Number or Latrine Depth, this is now an easy task. Let's get the mean, median, standard deviation and maximum value for the number of OTUs collected in Tanzania vs Vietnam.
# group and then summarise data
split_gathered_data %>%
group_by(Country) %>%
summarize(mean = mean(OTUs),
median = median(OTUs),
sd = sd(OTUs),
maximum = max(OTUs))
`summarise()` ungrouping output (override with `.groups` argument)
| Country | mean | median | sd | maximum |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| Tanzania | 122.2195 | 0 | 896.1408 | 17471 |
| Vietnam | 186.0921 | 0 | 863.3589 | 17572 |
In dealing with grouped data, we no longer have to grab a Country by subsetting or using helper functions to grab letters from their names. group_by() produces that modified data frame which recognizes that we have 2 countries and will perform calculations for both of them.
Now that we have tidy data, let's proceed to answering our questions:
What steps do we need to take to answer this question? Before we dive in, let's consider what we need to answer this question.
# Group and then summarize the data before sorting in descending order
split_gathered_data %>%
group_by(Country, Latrine_Number) %>%
summarise(mean = mean(OTUs)) %>%
arrange(desc(mean))
`summarise()` regrouping output by 'Country' (override with `.groups` argument)
| Country | Latrine_Number | mean |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Vietnam | 3 | 445.50000 |
| Vietnam | 17 | 414.63462 |
| Vietnam | 16 | 402.59615 |
| Vietnam | 11 | 338.28205 |
| Vietnam | 20 | 335.30769 |
| Vietnam | 10 | 328.19231 |
| Vietnam | 8 | 236.46154 |
| Vietnam | 15 | 206.99359 |
| Vietnam | 18 | 201.49038 |
| Tanzania | 2 | 191.37019 |
| Vietnam | 12 | 188.07692 |
| Vietnam | 6 | 172.44231 |
| Vietnam | 5 | 158.63462 |
| Vietnam | 7 | 151.16026 |
| Vietnam | 19 | 150.12821 |
| Vietnam | 21 | 131.98077 |
| Tanzania | 5 | 130.46635 |
| Tanzania | 4 | 129.81923 |
| Vietnam | 1 | 119.80769 |
| Vietnam | 22 | 119.35897 |
| Vietnam | 13 | 98.49038 |
| Tanzania | 9 | 95.17308 |
| Vietnam | 9 | 93.39904 |
| Vietnam | 14 | 60.14744 |
| Tanzania | 6 | 56.35577 |
| Vietnam | 2 | 52.57051 |
| Tanzania | 3 | 47.05128 |
| Vietnam | 4 | 22.19231 |
Recall our original data set had three variables that were mashed together into a Site (Country_LatrineNo_Depth) and answering our question would have been difficult. With tidy data format we have split a combined variable column into three, making latrine_Number a variable. This allowed us to simply group our data to perform our mean calculation and get an answer.
This is a bit of variation on our first question. What kind of data do we need?
# Filter your data then group it before summarizing
split_gathered_data %>%
filter(Taxa == "Clostridia") %>%
group_by(Country, Taxa) %>% # Why are we grouping with Taxa as well?
summarize(sum = sum(OTUs),
mean = mean(OTUs),
sd = sd(OTUs)) %>%
arrange(desc(sum))
`summarise()` regrouping output by 'Country' (override with `.groups` argument)
| Country | Taxa | sum | mean | sd |
|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| Vietnam | Clostridia | 176079 | 3386.135 | 3735.849 |
| Tanzania | Clostridia | 101217 | 3490.241 | 4688.513 |
Again, being able to filter by Taxa and group by Country (as an isolated variable) helps a lot. With dplyr syntax we can perform all data manipulations and calculations in a code block that is readable. Note that we grouped by both Country and Taxa. This was done more for visual purposes to confirm that Clostridia was the only taxa coming out of our filter. We could easily have done group_by(Country) with the same result. Our final output, however, would not include the Taxa column.
Lets break down the question again into the basic components!
n() command from dplyrto count a group outsplit_gathered_data %>%
group_by(Country, Latrine_Number, Depth) %>%
filter(OTUs != 0) %>%
summarize(count = n()) %>%
arrange(desc(count))
`summarise()` regrouping output by 'Country', 'Latrine_Number' (override with `.groups` argument)
| Country | Latrine_Number | Depth | count |
|---|---|---|---|
| <chr> | <chr> | <chr> | <int> |
| Vietnam | 7 | 3 | 36 |
| Vietnam | 22 | 1 | 31 |
| Vietnam | 7 | 2 | 29 |
| Tanzania | 2 | 6 | 27 |
| Vietnam | 11 | 1 | 27 |
| Vietnam | 11 | 3 | 27 |
| Tanzania | 2 | 9 | 26 |
| Vietnam | 15 | 2 | 26 |
| Vietnam | 15 | 3 | 26 |
| Vietnam | 18 | 4 | 26 |
| Vietnam | 4 | 2 | 26 |
| Tanzania | 4 | 5 | 25 |
| Vietnam | 6 | 3 | 25 |
| Vietnam | 9 | 4 | 25 |
| Tanzania | 4 | 4 | 24 |
| Tanzania | 4 | 6 | 23 |
| Tanzania | 6 | 7 | 23 |
| Tanzania | 5 | 2 | 22 |
| Vietnam | 22 | 4 | 22 |
| Vietnam | 5 | 3 | 22 |
| Tanzania | 3 | 2 | 21 |
| Vietnam | 19 | 3 | 21 |
| Vietnam | 3 | 2 | 21 |
| Vietnam | 9 | 1 | 21 |
| Tanzania | 2 | 2 | 20 |
| Tanzania | 5 | 3 | 20 |
| Tanzania | 6 | 8 | 20 |
| Vietnam | 14 | 1 | 20 |
| Vietnam | 15 | 1 | 20 |
| Vietnam | 16 | 2 | 20 |
| ... | ... | ... | ... |
| Vietnam | 4 | 1 | 18 |
| Vietnam | 9 | 2 | 18 |
| Vietnam | 12 | 1 | 17 |
| Vietnam | 16 | 1 | 17 |
| Vietnam | 2 | 3 | 17 |
| Vietnam | 21 | 1 | 17 |
| Vietnam | 7 | 1 | 17 |
| Tanzania | 6 | 5 | 16 |
| Tanzania | 9 | 2 | 16 |
| Vietnam | 12 | 2 | 16 |
| Vietnam | 13 | 1 | 16 |
| Vietnam | 18 | 1 | 16 |
| Tanzania | 2 | 1 | 15 |
| Tanzania | 3 | 3 | 15 |
| Tanzania | 4 | 7 | 15 |
| Vietnam | 5 | 1 | 15 |
| Vietnam | 1 | 2 | 14 |
| Tanzania | 9 | 3 | 13 |
| Vietnam | 2 | 1 | 12 |
| Vietnam | 9 | 3 | 12 |
| Tanzania | 5 | 4 | 11 |
| Tanzania | 5 | 5 | 11 |
| Tanzania | 3 | 5 | 10 |
| Tanzania | 4 | 3 | 9 |
| Tanzania | 6 | 2 | 9 |
| Vietnam | 14 | 2 | 9 |
| Tanzania | 2 | 10 | 7 |
| Tanzania | 2 | 7 | 7 |
| Tanzania | 2 | 12 | 6 |
| Tanzania | 9 | 1 | 2 |
Since we can group by the 3 variables that were in the Site name, there is no disadvantage to having our data in tidy format compared to our original wide data frame. However now we are able to filter for non-zero OTUs, which was impossible in the wide format. Since we know from earlier in the lesson that each Taxa is only represented once for each site, we only have to count and order the number of observations to get our answer.
To get data back into its original format, there are reciprocal functions in the tidyr package, making it possible to switch between wide and long formats.
Fair question: But you've just been telling me how great the 'long' format is?!?! Why would I want the wide format again???
Honest answer: Note that our original data frame was 52 rows and expanded to 4212 rows in the long format. When you have, say, a genomics dataset you might end up with 6,000 rows expanding to 600,000 rows. You probably want to do your calculations and switch back to the more "human writeable/readable" format. Sure, I can save a data frame with 600,000 rows, but I can't really send it to anyone because spreadsheet software such as Excel might crash while trying to open the file.
unite() your columns back again¶unite(
data,
col,
...,
sep = "_",
remove = TRUE,
na.rm = FALSE
)
The opposite of separate(), we need to provide unite() with information to help consolidate our information.
unite() what kind of character to put between recombined data values going into col.Collapse Country, Latrine_Number and Depth back into one variable, 'Site', using the unite() function. Store the output in a data frame called united_data.
# Put our three variables Country, Latrine_Number, and Depth back together
united_data = split_gathered_data %>%
unite("Site", c("Country", "Latrine_Number", "Depth"), sep="_")
united_data
| Taxa | Site | OTUs |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Acidobacteria_Gp1 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp10 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp14 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp16 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp17 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp18 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp21 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp22 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp3 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp4 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp5 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp6 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp7 | Tanzania_2_1 | 0 |
| Acidobacteria_Gp9 | Tanzania_2_1 | 0 |
| Actinobacteria | Tanzania_2_1 | 110 |
| Alphaproteobacteria | Tanzania_2_1 | 11 |
| Anaerolineae | Tanzania_2_1 | 2 |
| Bacilli | Tanzania_2_1 | 19 |
| Bacteroidia | Tanzania_2_1 | 1547 |
| Betaproteobacteria | Tanzania_2_1 | 2 |
| Caldilineae | Tanzania_2_1 | 0 |
| Chlamydiae | Tanzania_2_1 | 0 |
| Chloroflexi | Tanzania_2_1 | 0 |
| Chrysiogenetes | Tanzania_2_1 | 0 |
| Clostridia | Tanzania_2_1 | 6213 |
| Cyanobacteria | Tanzania_2_1 | 0 |
| Dehalococcoidetes | Tanzania_2_1 | 0 |
| Deinococci | Tanzania_2_1 | 0 |
| Deltaproteobacteria | Tanzania_2_1 | 0 |
| Epsilonproteobacteria | Tanzania_2_1 | 0 |
| ... | ... | ... |
| Chloroflexi | Vietnam_9_4 | 0 |
| Chrysiogenetes | Vietnam_9_4 | 0 |
| Clostridia | Vietnam_9_4 | 3051 |
| Cyanobacteria | Vietnam_9_4 | 1 |
| Dehalococcoidetes | Vietnam_9_4 | 0 |
| Deinococci | Vietnam_9_4 | 88 |
| Deltaproteobacteria | Vietnam_9_4 | 11 |
| Epsilonproteobacteria | Vietnam_9_4 | 0 |
| Erysipelotrichi | Vietnam_9_4 | 43 |
| Fibrobacteria | Vietnam_9_4 | 0 |
| Flavobacteria | Vietnam_9_4 | 56 |
| Fusobacteria | Vietnam_9_4 | 0 |
| Gammaproteobacteria | Vietnam_9_4 | 559 |
| Gemmatimonadetes | Vietnam_9_4 | 38 |
| Holophagae | Vietnam_9_4 | 0 |
| Lentisphaeria | Vietnam_9_4 | 0 |
| Methanobacteria | Vietnam_9_4 | 0 |
| Methanomicrobia | Vietnam_9_4 | 0 |
| Mollicutes | Vietnam_9_4 | 3 |
| Nitrospira | Vietnam_9_4 | 0 |
| Opitutae | Vietnam_9_4 | 3 |
| Planctomycetacia | Vietnam_9_4 | 95 |
| Sphingobacteria | Vietnam_9_4 | 146 |
| Spirochaetes | Vietnam_9_4 | 18 |
| Subdivision3 | Vietnam_9_4 | 0 |
| Synergistia | Vietnam_9_4 | 0 |
| Thermomicrobia | Vietnam_9_4 | 35 |
| Thermoplasmata | Vietnam_9_4 | 0 |
| Thermotogae | Vietnam_9_4 | 0 |
| Unknown | Vietnam_9_4 | 303 |
spread() to convert your data from long to wide format¶spread(
data,
key,
value,
fill = NA,
convert = FALSE,
drop = TRUE,
sep = NULL
)
The opposite of gather(), we need to provide spread() with information to help consolidate our information. This can be tricky to conceptualize BUT the goal is to consolidate row entries based on specific columns that we do NOT name.
Note: Much like gather(), the spread() function has been retired in tidyr and replaced with pivot_wider() which has many more options allowing for more user-friendly and flexible conversion. For our purposes spread() will do just fine.
Use the spread() function to turn unite_data into the wide shape of our original dataset. Save the output into a data frame called spread_data.
# Remember that we still have 3 variables: Taxa, Site, OTUs. We want to reunite each site into a column of OTUs
spread_data <- united_data %>%
spread(key = Site, value = OTUs)
head(spread_data)
| Taxa | Tanzania_2_1 | Tanzania_2_10 | Tanzania_2_12 | Tanzania_2_2 | Tanzania_2_3 | Tanzania_2_6 | Tanzania_2_7 | Tanzania_2_9 | Tanzania_3_2 | ... | Vietnam_6_2 | Vietnam_6_3 | Vietnam_7_1 | Vietnam_7_2 | Vietnam_7_3 | Vietnam_8_2 | Vietnam_9_1 | Vietnam_9_2 | Vietnam_9_3 | Vietnam_9_4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| Acidobacteria_Gp1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 18 | 0 | 38 |
| Acidobacteria_Gp10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 9 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 17 |
| Acidobacteria_Gp16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Acidobacteria_Gp18 | 0 | 0 | 0 | 2 | 2 | 5 | 0 | 15 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
We'll use the standard write.csv command to save our results to our data folder.
ggplot(split_gathered_data, aes(x=Depth)) +
geom_bar()
getwd()
write.csv(x=spread_data, file="data/taxa_pitlatrine_wide.ver2.csv", row.names = FALSE, quote = FALSE)
united_data_no_zeros <- united_data %>%
filter(OTUs != 0) %>%
arrange(desc(OTUs)) %>%
.[1:20, ]
## Don't forget what that "." means. Kind of a short-hand for "this"
united_data_no_zeros
| Taxa | Site | OTUs |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Clostridia | Vietnam_16_2 | 17572 |
| Clostridia | Tanzania_2_9 | 17471 |
| Clostridia | Vietnam_17_2 | 13875 |
| Clostridia | Tanzania_2_6 | 12169 |
| Bacteroidia | Vietnam_17_1 | 11392 |
| Clostridia | Vietnam_10_1 | 11180 |
| Clostridia | Tanzania_2_3 | 10944 |
| Unknown | Tanzania_9_2 | 10624 |
| Clostridia | Tanzania_4_4 | 10545 |
| Clostridia | Vietnam_16_1 | 10043 |
| Clostridia | Tanzania_2_2 | 8999 |
| Clostridia | Tanzania_5_3 | 8981 |
| Alphaproteobacteria | Vietnam_3_2 | 8660 |
| Clostridia | Vietnam_15_1 | 8244 |
| Clostridia | Vietnam_17_1 | 7722 |
| Betaproteobacteria | Tanzania_4_3 | 7710 |
| Clostridia | Vietnam_18_1 | 7481 |
| Alphaproteobacteria | Vietnam_3_1 | 7322 |
| Clostridia | Vietnam_18_4 | 6246 |
| Clostridia | Tanzania_2_1 | 6213 |
microbes_wide.csv to long format¶Reshape microbes_wide.csv into a long format version called microbes_long. Make sure the columns are in the following order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".
Once you're done, write microbes_long as microbes.csv.
Nots about microbes_wide.csv
# Import your data file and look at it
microbes_wide <- read.csv("data/microbes_wide.csv")
head(microbes_wide)
| ASV | taxa | brackish_pyrene_treatment_1 | brackish_pyrene_treatment_2 | brackish_pyrene_treatment_3 | brackish_substrate.free_control_1 | brackish_substrate.free_control_2 | brackish_substrate.free_control_3 | brackish_substrate.free_NA_1 | brackish_toluene_treatment_1 | ... | fresh_toluene_treatment_2 | fresh_toluene_treatment_3 | saline_pyrene_treatment_1 | saline_pyrene_treatment_2 | saline_pyrene_treatment_3 | saline_substrate.free_control_1 | saline_substrate.free_control_2 | saline_toluene_treatment_1 | saline_toluene_treatment_2 | saline_toluene_treatment_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | GAATAGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGACTTGACATCCTAAAAACATCTAAGAGATTAGAAAGTGCTAGTTTACTAGAATTTAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCACGTGTTTAGTTGCTAACAGCTCGGCTGAGCACTCTAAACAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT | Bacteria_Proteobacteria_Epsilonproteobacteria_Campylobacterales_Campylobacteraceae_Campylobacter | 0.0 | 0.00 | 0 | 0 | 0 | 0.03 | NA | 0 | ... | 0 | 0 | 0 | 0.0 | 0 | 0.00 | 0 | 0 | 0 | 0 |
| 2 | GAATAGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGATTGAAATTTAGATGTTGGCAGATGAGAGTTTGCTTTCCTTTGGGACATCTAAGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCGCGTCGATAGTTACTAACAGGTTAAGCTGAGGACTCTATCGAGACAGCCGTCGTAAGACGTGAGGAAGGGGCGGATGACGTCAAATCAGCACGGCCCTTACATCCGGGGCGACACACGTGTTACAATGGCAGGGACAAAGGGAAGCGACATGGTGACATGAAGCGGATCTCCAAACCCTGTCCCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA | Bacteria_Bacteroidetes_Bacteroidia_Bacteroidales_Porphyromonadaceae_Porphyromonas | 0.0 | 0.03 | 0 | 0 | 0 | 0.00 | NA | 0 | ... | 0 | 0 | 0 | 0.0 | 0 | 0.00 | 0 | 0 | 0 | 0 |
| 3 | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGCTCTTGACATCTTCGGAACTTTCTAGAGATAGATTGGTGCCTTCGGGAACCGAATGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCACGTAATGGTGGGAACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTACTACAATGGTGGGGACAGAGGGTCGCAATGCCGCGAGGCGGAGCCAATCCCAGAAACCCTATCTTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA | Bacteria_Proteobacteria_Gammaproteobacteria_Xanthomonadales_Xanthomonadaceae_Arenimonas | 0.0 | 0.00 | 0 | 0 | 0 | 0.00 | NA | 0 | ... | 0 | 0 | 0 | 0.0 | 0 | 0.00 | 0 | 0 | 0 | 0 |
| 4 | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGCCCTTGACATCCTCGGAATCCTTCAGAGATGAGGGAGTGCCTTCGGGAACCGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCCTAGTTGCCAGCATTTCGGATGGGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTATGGGCAGGGCTACACACGTGCTACAATGGCCGGTACAGTGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAAAAAGCCGGTCGTAGTCCGAATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA | Bacteria_Proteobacteria_Gammaproteobacteria_Chromatiales_Ectothiorhodospiraceae_Ectothiorhodospira | 0.1 | 0.00 | 0 | 0 | 0 | 0.00 | NA | 0 | ... | 0 | 0 | 0 | 0.1 | 0 | 0.00 | 0 | 0 | 0 | 0 |
| 5 | GAATAGGCGGGGACCCGCACAAGCAGCGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCTGAGAGATCAGACTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTGGTTAGTTGCCATCATTTAGTTGGGCACTCTAAGCAGACTGCCGTGGATAACACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTCTGAACAGAGGGTTGCGAAACCGCGAGGTGAAGCTAATCCCTTAAAACAGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTTGGAGTTGCTAGTAATCGCAGATCAGAATGCTGCGGTGAATGCGTTCCCGGGTCTTGCACA | Bacteria_Firmicutes_Clostridia_Clostridiales_Eubacteriaceae_Acetobacterium | 0.0 | 0.20 | 0 | 0 | 0 | 0.00 | NA | 0 | ... | 0 | 0 | 0 | 0.0 | 0 | 0.03 | 0 | 0 | 0 | 0 |
| 6 | GAATAGGCGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGCCTTGACATCCTTAGAATCTTTTAGAGATAAGAGAGTGCCTAGTTTACTAGGAGCTAAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATCATTAGTTGCTAACAGTTAGGCTGAGAACTCTAATGAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT | Bacteria_Proteobacteria_Epsilonproteobacteria_Campylobacterales_Campylobacteraceae_Arcobacter | 0.0 | 0.00 | 0 | 0 | 0 | 0.00 | NA | 0 | ... | 0 | 0 | 0 | 0.0 | 0 | 0.00 | 0 | 0 | 0 | 0 |
# Proceed to break the different observations and information apart
microbes_long <- microbes_wide %>%
gather(key = sample_name, value = abundance, -ASV, -taxa) %>%
separate
head(microbes_long)
colnames(microbes_long)
| ASV | taxa | sample_name | abundance | |
|---|---|---|---|---|
| <chr> | <chr> | <chr> | <dbl> | |
| 1 | GAATAGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGACTTGACATCCTAAAAACATCTAAGAGATTAGAAAGTGCTAGTTTACTAGAATTTAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCACGTGTTTAGTTGCTAACAGCTCGGCTGAGCACTCTAAACAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT | Bacteria_Proteobacteria_Epsilonproteobacteria_Campylobacterales_Campylobacteraceae_Campylobacter | brackish_pyrene_treatment_1 | 0.0 |
| 2 | GAATAGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGATTGAAATTTAGATGTTGGCAGATGAGAGTTTGCTTTCCTTTGGGACATCTAAGTAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCGCGTCGATAGTTACTAACAGGTTAAGCTGAGGACTCTATCGAGACAGCCGTCGTAAGACGTGAGGAAGGGGCGGATGACGTCAAATCAGCACGGCCCTTACATCCGGGGCGACACACGTGTTACAATGGCAGGGACAAAGGGAAGCGACATGGTGACATGAAGCGGATCTCCAAACCCTGTCCCAGTTCGGATCGGAGTCTGCAACTCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGTACA | Bacteria_Bacteroidetes_Bacteroidia_Bacteroidales_Porphyromonadaceae_Porphyromonas | brackish_pyrene_treatment_1 | 0.0 |
| 3 | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAGCTCTTGACATCTTCGGAACTTTCTAGAGATAGATTGGTGCCTTCGGGAACCGAATGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCACGTAATGGTGGGAACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTACTACAATGGTGGGGACAGAGGGTCGCAATGCCGCGAGGCGGAGCCAATCCCAGAAACCCTATCTTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA | Bacteria_Proteobacteria_Gammaproteobacteria_Xanthomonadales_Xanthomonadaceae_Arenimonas | brackish_pyrene_treatment_1 | 0.0 |
| 4 | GAATAGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGCCCTTGACATCCTCGGAATCCTTCAGAGATGAGGGAGTGCCTTCGGGAACCGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCCTAGTTGCCAGCATTTCGGATGGGAACTCTAGGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTATGGGCAGGGCTACACACGTGCTACAATGGCCGGTACAGTGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAAAAAGCCGGTCGTAGTCCGAATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCATTGCCGCGGTGAATACGTTCCCGGGCCTTGTACA | Bacteria_Proteobacteria_Gammaproteobacteria_Chromatiales_Ectothiorhodospiraceae_Ectothiorhodospira | brackish_pyrene_treatment_1 | 0.1 |
| 5 | GAATAGGCGGGGACCCGCACAAGCAGCGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCTCTGACAATCTGAGAGATCAGACTTTCCCTTCGGGGACAGAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTGGTTAGTTGCCATCATTTAGTTGGGCACTCTAAGCAGACTGCCGTGGATAACACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTCTGAACAGAGGGTTGCGAAACCGCGAGGTGAAGCTAATCCCTTAAAACAGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTTGGAGTTGCTAGTAATCGCAGATCAGAATGCTGCGGTGAATGCGTTCCCGGGTCTTGCACA | Bacteria_Firmicutes_Clostridia_Clostridiales_Eubacteriaceae_Acetobacterium | brackish_pyrene_treatment_1 | 0.0 |
| 6 | GAATAGGCGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGATACGCGAAGAACCTTACCTGGCCTTGACATCCTTAGAATCTTTTAGAGATAAGAGAGTGCCTAGTTTACTAGGAGCTAAGTGACAGGTGCTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATCATTAGTTGCTAACAGTTAGGCTGAGAACTCTAATGAGACTGCCTTCGTAAGGAGGAGGAAGGTGAGGACGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGGAAGGACAGTGAGACGCGATACCGCGAGGTGGAGCAAATCTATAAACCTTCTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAACATGAAGCTGGAATCGCTAGTAATCGTAAATCAGCAATGTTACGGTGAATACGTTCCCGGGTCTTGTACT | Bacteria_Proteobacteria_Epsilonproteobacteria_Campylobacterales_Campylobacteraceae_Arcobacter | brackish_pyrene_treatment_1 | 0.0 |
# Our column are in the wrong order! We need to shuffle "salinity", "compound", "group", and "replicate" towards the left
# relocate columns to match this order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".
microbes_long <- microbes_long[ , c(...)]
colnames(microbes_long)
getwd()
gapminder_wide to a long format¶Read in the gapminder_wide.csv. What rules of tidy data does it break? Transform the dataset to the format below.
| continent | country | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.4453 |
| Asia | Afghanistan | 1957 | 30.332 | 9240934 | 820.8530 |
| Asia | Afghanistan | 1962 | 31.997 | 10267083 | 853.1007 |
| Asia | Afghanistan | 1967 | 34.020 | 11537966 | 836.1971 |
| Asia | Afghanistan | 1972 | 36.088 | 13079460 | 739.9811 |
| Asia | Afghanistan | 1977 | 38.438 | 14880372 | 786.1134 |
gapminder_longReshape gapminder_wide into a long, cleansed format. BEFORE you start writing any code, IDENTIFY the formatting deficiencies and then PLAN ahead what you want to achieve.
library(gapminder)
# read gapminder_wide.csv
gapminder_wide <- read.csv(...)
head(gapminder_wide)
Observations appear to be split into to visible categories based on continent and country
So we are seeing multiple observations per column AND multiple observation types in the table (GPP, life expectancy and population)! What a mess!!
# Let's do a bulk conversion into long format
gapminder_long <- gapminder_wide %>%
head(gapminder_long)
We've generated a long format table but we still have obs_type as a variable which traps two kind of information
We need to break that up into separate variables.
# We need to break the multiple observations from a column into multiple columns
# Divide obs_type into "obs_type" and "year"
gapminder_long <- gapminder_long %>%
...; head(gapminder_long)
str(gapminder_long)
spread() to widen your table based on a mixed category observation¶We've successfully separate our dual-variable in obs_type and year. However the three observation types are still trapped in the same table and column! We'll have to take a small step back before we can take a step forward. Send it back into slightly wider format!
# Spread data from obs_type as its contents should be their own variables
gapminder_long <- gapminder_long %>%
...; head(gapminder_long)
str(gapminder_long)
select() to rearrange your columns!¶Nearly there! Let's review what we want as a final data table.
| continent | country | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.4453 |
| Asia | Afghanistan | 1957 | 30.332 | 9240934 | 820.8530 |
| Asia | Afghanistan | 1962 | 31.997 | 10267083 | 853.1007 |
| Asia | Afghanistan | 1967 | 34.020 | 11537966 | 836.1971 |
| Asia | Afghanistan | 1972 | 36.088 | 13079460 | 739.9811 |
| Asia | Afghanistan | 1977 | 38.438 | 14880372 | 786.1134 |
# Rearrange columns (optional)
# Don't forget some of our commands from lecture 2! ie 2.3.0 select()
gapminder_long <- gapminder_long %>%
; head(gapminder_long)
getwd()
Good job!
https://github.com/wmhall/tidyr_lesson/blob/master/tidyr_lesson.md
http://vita.had.co.nz/papers/tidy-data.pdf
https://thinkr.fr/tidyverse-hadleyverse/
http://stat545.com/bit001_dplyr-cheatsheet.html
http://dplyr.tidyverse.org/articles/two-table.html
Soon after the end of this lecture, a homework will be available for you on DataCamp. You will have until 1200 hours on Thursday, March 11th, to submit your assignment (right before the next lecture). This is a pass-fail assignment where you need to achieve a least 75% out of the total points possible to pass.
Please install the following packages from the conda-forge channel for next time:
tidyverse (ggplot2, tidyr, dplyr, forcats) These should already be installed from a previous installation of tidyverse
RColorBrewer
viridis
gridExtra
ggrepel
ggbeeswarm
ggthemes
ggpubr
UpSetR
Thanks for coming!
